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B. R. Rau, M. Lee, P. P. Tirumalai, and M. S. Schlansker. Register allocation for software pipelined loops. In Proceedings 
Qf the ACM SIGPLAN '92 Conference on Programming Language Design and Implementation, pages 283-299, San 
Francisco, California, June 17-19, 1992. SIGPLAN Notices, 27(7), July 1992. 
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Advanced Vector Architectures - Espasa (1997) {Coj/jLotl 



....needed. On top of that, program transformations such as loop blocking [PHH89, WL91, KM92, LRW91b, Li95, CM95] have 
proven very useful to fit the working set of a program into multilevel memory hierarchies. Introduction 9 Related to data 
caching, software pipelining [Lam88, GHW90, GAG94, Jai91, RLTS92, Ram94, Rau94] has also contributed to hide 
memory latency and the penalties associated with cache misses by overlapping several iterations of a single loop. 

Decoupling Decoupled scalar processors [SWP86, Smi84, KHC94] have focused on numerical computation and attack the 
memory latency problem .... 

B. R. RaU; M. Lee, P. P. Ti?i.irnais^i, arsd M. S. Schlani^ker. feg/<?/er for sofbi^are pipelined hops. U^ Proceedings of the 

ACM SIGPLAN ^92 Conference on PsoarafTirnsng Lanquaqe Dei>ipn and HT5D:emer:Ustjon, pages 263-299. San Fiencisco, 
California?. JtJne 17-19, 1992. SiGPLAN Notice:^. 27(7), Juiy 1992. 
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....more simultaneously live values exist than physical registers, spill code must be added and can significantly increase the 
achieved II of the loop. In this case, it may be possible to achieve a better final II by increasing the candidate II and 
attempting to schedule the original loop body again |26]. If a lower bound on the loop s final register requirement for a given II 
were available, it would be useful during both optimization and scheduling. During optimization it could be used to stop 
optimization before excessive register pressure is generated. During scheduling, the candidate II s .... 

8. R. Raii, M. Lee: P^ P. Tjfurnaiei, arj<J M. S. Scrslansker, "/?eig/efera/A>co;;i>/7 for ^of^vva;^ pipe//ni^>d /oop^;' in Proceedsrjgs of vhe 
ACM SIGPLAN 92 Conference on PrograJY^m^ng Languago Design and hYipiernentation, pp. 283 -299. June 1992. 
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....the successive outputs of an operation can be kept in distinct registers. In the absence of hardware support, the loop may 
be unrolled and the duplicate register specifiers renamed appropriately [9] However, this modulo variable expansion 
technique can result in a large amount of code expansion [13|. A rotating register file can solve this problem without 
duplicating code. Consider saving the series of values generated by an operation in its own infinite pushdown stack. Old values 
can be read out of anywhere in the stack, and new values can be pushed on top, but a value cannot be modified .... 

....around a vector of length II. In any case, the LiveVector s maximum, MaxLive, is the desired lower bound. Allocating registers 
for a modulo scheduled loop is beyond the scope of this paper. For an extensive discussion of the problem, including 
heuristic solutions and empirical results, consult [18]. One of the most remarkable results reported in that paper is the ability 
of their allocation strategies to almost always achieve the MaxLive lower boundon a schedule s register pressure 4 . Due to that 
result, this paper approximates a schedule s register pressure with its MaxLive lower bound 

B. R. RaU; M. Lee, P, TirumasaL and M. S, Schlansker. f?eg/^rer a//oc£?f/o/? J^r ^?o??vya/^:> p/pel/ned fcopc?. \n Pfoceedsngs of the 
ACM SIGPLAN ^92 Conference on Progras^iming Langu^^ge Design and Smpies-nentatEon, pages 263-299. June 1992. 
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....previous pfid just because its load pipeline has three stages. On the contrary, our architecture adopts ordinary waiting 
mechanism for requested data. Due to this fact, our architecture does not need serious changes in the architecture. Modulo 
scheduling on rotating register files is proposed in [RLTS92]. In rotating register files, logical register number is apart from 
physical register number. In this point, rotating register files are similar to our slidewlndowed registers. However, in rotating 
register files, the total number of physical registers is not increased. Therefore, long memory .... 

S.R.Rau, M.Lee. F.P/rirumalai, arid M,S.Gc;rilani>ker, "feg/^^^e^.A//oc^:^#o/> /br So/rvyi:?>-c> Pipelmed Loop3'\ Proc. ACM SIGPLAN '92 
ConL on Programming Language Design ^^nd h^^plernefnation, pp283-2S§, 1992 
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Register Ailocation for Predicated Code Eichenberaer. Davidson (1995) (5 citations) {Coirect), 



....a framework based on cyclic interval graphs, introducing the notion of time in the register allocator paradigm. This additional 
notion of time is particularly useful for the live ranges of a loop, where live ranges may cross the boundary of an iteration. Another 
approach, investigated by Rau et al. [1 2|, proposes a general framework for the allocation of registers in software 
pipelined loops for various code generation and hardware support schemes. The second contribution of this paper is a set 
of heuristics that reduces the register requirements by allowing non interfering virtual registers .... 

....For register allocators based on Chaitin s graph coloring framework [9] 10] register allocation for predicated code can be 
achieved simply by using the refined interference graph instead of the conventional one. However, several register allocators 
depart from the graph coloring method [11)112] as graph coloring methods do not provide a notion of time that is 
particularly useful for the live ranges of a loop, which may cross the boundary of an iteration. Also, nontraditional 
constraints such as the one presented in [12] to support various code generation and hardware support schemes are .... 

[Article contains additional citation context not shown here] 
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Jjine •1992. 
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....to satisfy the timing constraints, software pipelining [2] also called loop pipelining or loop folding, is required. Previously [15] we 
showed that a heuristic like list scheduling for loop pipelining is unable to satisfy the timing and resource constraints even for 
simple examples. Rau et al. [11] successfully perform register binding tuned to pipelined loops. They mention that for better 
code quality Concurrent scheduling and register allocation is preferable , but for reasons of run time efficiency they solve the 
problem of scheduling and register binding in separate phases. Some .... 

B.R, R^Ki, M, Le<^, PP. Tjairn^sai and M.S. Schiansker, "Rc?g;5^i^>ri;j/toi:;an^^^^ /br 5C/?vvi??e pipu?/^>>ei:/ /oop^", Proc. of ihe SiGPLAN 
92 cool, on Programming language design and implfurnentatjon, pp. 283-299. June 1992 
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....Huffs Slack Scheduling [9] Wang, Eisenbeis, Jourdan and Su s FRLC [23] and Gasperoni and Schwiegelshohn s modified list 
scheduling [6] Experimental results show that the method described in this paper performed significantly better than these 
methods. 1 Introduction Software pipelining [1,4, 3, 11, 12, 13, 17> 18, 22] has been proposed as an efficient method for 
loop schedul This work was supported by research grants from NSERC (Canada) and MICRONET Network Centers of 
Excellence (Canada) To Appear in the Proceedings of the 27th Annual International Symposium on Microarchitectures 
(MICRO 27) San Jose,.... 

....in Section 7. 2 Exploiting the Space of Software Pipelined Schedules 2. 1 An Example We introduce the notion of rate 
optimal schedules under resource constraints, and illustrate how to search among them the ones which optimize the 
register usage with the help of a simple example loop taken from The loop L (in the C language) is: for (i = 0; i n; i ) f s = 

s a[i] a[i] s s a[i] g The dependence graph for the loop L is depicted in Figure 1 . SO SI S2 S3 S4 S5 Figure 1 : Dependence Graph 
of Loop L Consider an architecture with 3 pipelined homogeneous function units. Assume .... 
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....is rarely gathered and exploited in the optimizer s strategy. There are isolated instances where this information is used to 
good effect, such as when combining instruction scheduling and register allocation [3, 5, 6, 19, 20, 30, 33, 34] or software 
pipelining and register allocation [16, 17, 21^ 23, 27, 32, 38, 44], While these techniques can improve program performance, 
they focus narrowly on the interaction of a single pair of optimizations, rather than more generally on the entire collection of 
optimizations to be applied to a program. Provided that enough useful information can be gathered and analyzed .... 

....of balance among the levels of demand for specific machine resources of particular interest to the two phases, and the supply 
and configuration ofthe target machine s resources. The most well known examples of this work focus on the interactions 
between software pipelining register allocation [16, 17, 21, 23, 27, 32, 38, 44], instruction scheduling and register 
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allocation [3, 5, 6, 19, 20, 30, 33, 34] instrainon scheduling and cache usage [28] aml^calar replacement and register 
allocation [8] All have in common the goal of creating a good match between the program characteristics, such as 
instruction placement .... 
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Non-Consistent Dual Register Files to Reduce Register Pressure - Llosa. Valero. Ayguade (1995) ('8 citations) iCi^L^iH:!;}. 

....Chaitin s technique based on graph coloring[14] Register allocation for software pipelined loops presents additional problems 
leading to unconventional solutions. How to allocate registers for modulo scheduled loops is beyond the scope of this 
paper (for an extensive discussion of the problem see The Wands Only strategy combined with the First Fit 
allocation schema have been chosen to allocate registers. Wands Only is the strategy that has the lowest empirical 
complexity, and the one that obtains the more optimal results in terms of number of registers. For this strategy all the .... 

....M3 and A4. The results of M3 are used by operation A4; since A4 has been scheduled in 5 We have chosen this example 
because it is very simple to calculate the registers required by the schedule. For an extensive discussion of the register 
allocation problem for software pipelined loops see 115], VALUE LI L2 M3 A4 M5 A6 Allocation GL LO LO RO RO RO 
Lifetime 13 7 6 6 6 4 Table 3: Allocation requirements of values for example loop, the left cluster, the values produced by M3 
could be allocated as left only values. The results of A4 are used by operation M5; since M5 has been scheduled .... 

B.R. Ra:^, M. L.ee, F. Tirumalai, <ind P. Schlansker, ^^e^/^fer e//oca?/on for ^?oSw<5re pipelfned loops. In Proceedings of the ACiVi 
SIQPLAN 'S2 Conference on Progri^mmsng Language Design and Smpiementation, pages 283™299, June 1992. 



Constraint Driven Ap proach To Loop Pipelining And,. - Mesman. Strik.. (1998) (Qocrev??;). 

....to satisfy the timing constraints, software pipelining [2] also called loop pipelining or loop folding, is required. Previously [15] we 
showed that a heuristic like list scheduling for loop pipelining is unable to satisfy the timing and resource constraints even for 
simple examples. Rau et al. [111 successfully perform register binding tuned to pipelined loops. They mention that for better 
code quality Concurrent scheduling and register allocation is preferable , but for reasons of run time efficiency they solve the 
problem of scheduling and register binding in separate phases. Some .... 

B.R, Ra-.i, M. Lee, P.P. Tirumaia; and M.S. Schlansker, "f?eg/i?/er a//oca//on for p/pe/^r?ed fcQps^ proc, of the SIGPLAN 

^92 conl on Programming language design and implemenMion, pp. 283-239, Jisne =992 



A Scalar Architecture for Pseudo Vector Processing based on.. - Nakamura Hiroshi (1 citation) (Qorrectl 

....three stages. Compared with i860 architecture, our architecture includes ordinary waiting nnechanism for requested data and 
successfully closes the growing gap between processor and mennory speed without serious changes in the architecture. Modulo 
scheduling on rotating register files is proposed in [RLTS92J. In rotating register files, logical register number is apart from 
physical register number. This is similar to our slide windowed registers. However, in rotating register files, the total number of 
physical registers is not increased. Therefore, long memory access latency cannot be hidden. This .... 

B.R. Rau. M.L^^^e: PP.llrumaiai, and iVl.S.Schlansker. ''R^?g/w/i//Qca//<:j;7 /or So^w<?;^? P/p^//^)edloop^:'\ Proc. ACM SIGPLAN '92 
Con;, on progranirniny Language Design and Imptementation, pp2S3"299, 1992 



Array Data Flow Analysis for Load-Store Optimizations in.. ■ Bodik. Gupta (1995) (2 citations) iConiect). 
No context found. 

B. R. Rau. M. Lee, P. P. 'Hrumaiai, S. Schiansker, ''Rec?/§:fer/^//oc^^KJn /br So/rvv^;re P/pe//;7<=>d Loop^/' Proc. of the SIGPLAN 
Conl'erDnc^; on Progranur^ing Language Deoign and Implefrientation: San Franolsco, Gayfernia, pages 212-223. June 1992. 
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